I have the following dataset for the Math2319(Machine Learning) Phase 1 Assignment.
The objective of this House Rent Prediction dataset is to predict the monthly rent prices of available homes which are based on based on various explanatory variables describing aspects of residential houses.It is sourced from Kaggle.com
The dataset comprises of 265190 house records with 22 columns
The report showcase following results:
The dataset is named as House Rent Prediction and it comprises of 265190 house records with 22 columns.
It is sourced from kaggle.com[https://www.kaggle.com/rkb0023/houserentpredictiondataset] (house-rent-prediction-dataset. (2021). Retrieved 11 April 2021, from https://www.kaggle.com/rkb0023/houserentpredictiondataset).
The main of aim of this project/report is to import the data, clean it so that it can be used in Machine learning algorithms to predict the monthly rent and highlights the descriptive features of this data and show the details of the data through data visualizations.
Our main aim is to create a model that will predict the monthly rent by using the other variables. For that our target feature is price variable which was later renamed as rent price.
This dataset contains details about price information of the different types of house listed on various websites in various area of United kingdom.
The dataset has 265190 house records with 22 columns. The Following columns are listed below:
Id: listing id.
url: listing URL
region: craigslist region
region_url: region URL
price: rent per month (Target Column)
type: housing type
sqfeet: total square footage
beds:number of beds
baths:number of bathrooms
cats_allowed: cats allowed boolean (1 = yes, 0 = no)
dogs_allowed: dogs allowed boolean (1 = yes, 0 = no)
smoking_allowed: smoking allowed boolean (1 = yes, 0 = no)
wheelchair_access: has wheelchair access boolean (1 = yes, 0 = no)
electric_vehicle_charge: has electric vehicle charger boolean (1 = yes, 0 = no)
comes_furnished: comes with furniture boolean (1 = yes, 0 = no)
laundry_options: laundry options available (1 = yes, 0 = no)
parking_options: parking options available (1 = yes, 0 = no)
image_url: image URL
description: description by poster
lat: latitude
long: longitude
state: state of listing
All the descriptive features listed above are self explantory.
To begin with data pre-processing we have to import certain functions in python.
The Function imported are warnings which is used to ignore the warnings while pre-processing.
Next, we use pandas and numpy which are very helpful in cleaning the dataset.Next, matplotlib.pylpot,Seaborn,altairand plotly is used to create the plots and their magic funtions are also imported.
import warnings
warnings.filterwarnings('ignore')
import pandas as pd
import numpy as np
# for plotting
import matplotlib.pyplot as plt
import plotly.express as px
%matplotlib inline
%config InlineBackend.figure_format = 'retina'
plt.style.use("ggplot")
pd.pandas.set_option('display.max_columns', None)
import seaborn as sns
sns.set()
%config InlineBackend.figure_format = 'retina'
import altair as alt
Here we are importing the csv of the dataset which is known as Housing.train and head() is used to display few intital observations of the dataset. The dataset is named as df.
df = pd.read_csv ('./housing_train.csv')
df.head()
| id | url | region | region_url | price | type | sqfeet | beds | baths | cats_allowed | dogs_allowed | smoking_allowed | wheelchair_access | electric_vehicle_charge | comes_furnished | laundry_options | parking_options | image_url | description | lat | long | state | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 7039061606 | https://bham.craigslist.org/apa/d/birmingham-h... | birmingham | https://bham.craigslist.org | 1195 | apartment | 1908 | 3 | 2.0 | 1 | 1 | 1 | 0 | 0 | 0 | laundry on site | street parking | https://images.craigslist.org/00L0L_80pNkyDeG0... | Apartments In Birmingham AL Welcome to 100 Inv... | 33.4226 | -86.7065 | al |
| 1 | 7041970863 | https://bham.craigslist.org/apa/d/birmingham-w... | birmingham | https://bham.craigslist.org | 1120 | apartment | 1319 | 3 | 2.0 | 1 | 1 | 1 | 0 | 0 | 0 | laundry on site | off-street parking | https://images.craigslist.org/00707_uRrY9CsNMC... | Find Your Way to Haven Apartment Homes Come ho... | 33.3755 | -86.8045 | al |
| 2 | 7041966914 | https://bham.craigslist.org/apa/d/birmingham-g... | birmingham | https://bham.craigslist.org | 825 | apartment | 1133 | 1 | 1.5 | 1 | 1 | 1 | 0 | 0 | 0 | laundry on site | street parking | https://images.craigslist.org/00h0h_b7Bdj1NLBi... | Apartments In Birmingham AL Welcome to 100 Inv... | 33.4226 | -86.7065 | al |
| 3 | 7041966936 | https://bham.craigslist.org/apa/d/birmingham-f... | birmingham | https://bham.craigslist.org | 800 | apartment | 927 | 1 | 1.0 | 1 | 1 | 1 | 0 | 0 | 0 | laundry on site | street parking | https://images.craigslist.org/00808_6ghZ8tSRQs... | Apartments In Birmingham AL Welcome to 100 Inv... | 33.4226 | -86.7065 | al |
| 4 | 7041966888 | https://bham.craigslist.org/apa/d/birmingham-2... | birmingham | https://bham.craigslist.org | 785 | apartment | 1047 | 2 | 1.0 | 1 | 1 | 1 | 0 | 0 | 0 | laundry on site | street parking | https://images.craigslist.org/00y0y_21c0FOvUXm... | Apartments In Birmingham AL Welcome to 100 Inv... | 33.4226 | -86.7065 | al |
df.shape() is used to see the shape of the dataset.Here, we can observe that the dataset contains 265190 rows and 22 columns
df.shape
(265190, 22)
As the dataset contains more than 5000 rows, we have to take a subset of the dataset.Here, we taking a random sample of 5000 rows of the dataset by using Sample function. The new subset is named as newdf.
newdf = df.sample(5000, random_state=999)
We are using print function to display the rows and columns of subset of dataset. we can see that there are exactly 5000 rows and 22 columns in the new subset which is created by sampling function.
print(newdf.shape)
(5000, 22)
dtypes is used to see the data types of the subset. Here we can see that there are 9 categorical variables, 10 integer and 3 float variables present.
print(newdf.dtypes)
id int64 url object region object region_url object price int64 type object sqfeet int64 beds int64 baths float64 cats_allowed int64 dogs_allowed int64 smoking_allowed int64 wheelchair_access int64 electric_vehicle_charge int64 comes_furnished int64 laundry_options object parking_options object image_url object description object lat float64 long float64 state object dtype: object
Here we are displaying the missing values in the available in the dataset. We can see that there are 1022 values in laundry_options, 1819 in parking_options , 26 each in both lat and long variables.
print(f"\nNumber of missing values:")
print(newdf.isnull().sum())
Number of missing values: id 0 url 0 region 0 region_url 0 price 0 type 0 sqfeet 0 beds 0 baths 0 cats_allowed 0 dogs_allowed 0 smoking_allowed 0 wheelchair_access 0 electric_vehicle_charge 0 comes_furnished 0 laundry_options 1022 parking_options 1819 image_url 0 description 0 lat 26 long 26 state 0 dtype: int64
#% of missing values
newdf.isna().mean().sort_values(ascending=False)
parking_options 0.3638 laundry_options 0.2044 lat 0.0052 long 0.0052 state 0.0000 baths 0.0000 url 0.0000 region 0.0000 region_url 0.0000 price 0.0000 type 0.0000 sqfeet 0.0000 beds 0.0000 dogs_allowed 0.0000 cats_allowed 0.0000 smoking_allowed 0.0000 wheelchair_access 0.0000 electric_vehicle_charge 0.0000 comes_furnished 0.0000 image_url 0.0000 description 0.0000 id 0.0000 dtype: float64
parking_options contains 36% of missing values, laundry_options contains 20%,while lat and long variables contains 5.2% of missing values.
To proceed further in our data visualization and later on in building the Machine Learning model, we have to deal with these missing values otherwise they will create a lot of headache for us.
There are various ways to deal with them such as imputing with mean, median or mode,etc or else we can simply remove them from our dataset. Here, we will use dropna() function to remove rows containing these missing values. We will remove the values and allot them into a new variable new_df.
new_df = newdf.dropna()
new_df.shape
(3065, 22)
In the above, we can see that the new_df contains 3065 rows and 22 columns. Earlier it was 5000 rows and 22 columns. this shows that all the rows containing missing values have been removed from our dataset.
Just to be sure we are using isnull().sum() function to check if there are any missing values still present in our dataset or not.
print(f"\nNumber of missing values:")
print(new_df.isnull().sum())
Number of missing values: id 0 url 0 region 0 region_url 0 price 0 type 0 sqfeet 0 beds 0 baths 0 cats_allowed 0 dogs_allowed 0 smoking_allowed 0 wheelchair_access 0 electric_vehicle_charge 0 comes_furnished 0 laundry_options 0 parking_options 0 image_url 0 description 0 lat 0 long 0 state 0 dtype: int64
As we can see that there are missing values remaining in our dataset.
Now to proceed further, we are removing irrelevant columns present in our dataset by using drop() function and naming the new variables as Freshdata.
#dropping columns
Freshdata = new_df.drop(['id','url','region','region_url','image_url','description','state','lat','long'], axis=1)
We have removed id,url,region,region_url,image_url,description,state ,lat, long these columns as they were of no use to us.
We are using columns() function to display the available column in our new dataset.
Freshdata.columns
Index(['price', 'type', 'sqfeet', 'beds', 'baths', 'cats_allowed',
'dogs_allowed', 'smoking_allowed', 'wheelchair_access',
'electric_vehicle_charge', 'comes_furnished', 'laundry_options',
'parking_options'],
dtype='object')
As we can see that all the irrelevant column have been removed from our dataset.
Now, we will change the names of the columns for our convenience and naming the dataset Finaldf for further use.
Finaldf = Freshdata.rename(columns={'price':'rent price','type':'type','sqfeet':'sqfeet', 'beds':'beds', 'baths':'baths',
'cats_allowed':'cats allowed','dogs_allowed':'dogs allowed','smoking_allowed':'smoking allowed',
'wheelchair_access':'wheelchair access','electric_vehicle_charge':'electric vehicle charge',
'comes_furnished':'furnished', 'parking_options':'parking options',
'laundry_options':'Laundry options'})
Below we are using columns functions again to check the new column names.
Finaldf.columns
Index(['rent price', 'type', 'sqfeet', 'beds', 'baths', 'cats allowed',
'dogs allowed', 'smoking allowed', 'wheelchair access',
'electric vehicle charge', 'furnished', 'Laundry options',
'parking options'],
dtype='object')
To check the first 5 rows of our refurbished dataset, we are using head() function.
Finaldf.head()
| rent price | type | sqfeet | beds | baths | cats allowed | dogs allowed | smoking allowed | wheelchair access | electric vehicle charge | furnished | Laundry options | parking options | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 231254 | 1070 | apartment | 1336 | 3 | 2.0 | 1 | 1 | 1 | 0 | 0 | 0 | w/d in unit | off-street parking |
| 39715 | 1300 | apartment | 760 | 1 | 1.0 | 1 | 1 | 0 | 1 | 0 | 0 | laundry on site | carport |
| 61857 | 995 | apartment | 740 | 2 | 1.0 | 1 | 1 | 1 | 0 | 0 | 0 | laundry on site | off-street parking |
| 1984 | 785 | apartment | 1100 | 2 | 2.0 | 1 | 1 | 1 | 0 | 0 | 0 | laundry on site | off-street parking |
| 145290 | 650 | apartment | 750 | 1 | 1.0 | 0 | 0 | 1 | 0 | 0 | 0 | laundry in bldg | detached garage |
To check the last 5 rows of our refurbished dataset, we are using tail() function.
Finaldf.tail()
| rent price | type | sqfeet | beds | baths | cats allowed | dogs allowed | smoking allowed | wheelchair access | electric vehicle charge | furnished | Laundry options | parking options | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 194620 | 400 | apartment | 1100 | 2 | 1.0 | 0 | 0 | 1 | 0 | 0 | 0 | w/d hookups | off-street parking |
| 65529 | 1685 | apartment | 1600 | 3 | 2.0 | 1 | 1 | 1 | 0 | 0 | 0 | w/d in unit | off-street parking |
| 141086 | 1102 | apartment | 1052 | 2 | 2.0 | 1 | 1 | 0 | 0 | 0 | 0 | w/d in unit | attached garage |
| 195110 | 725 | apartment | 1015 | 2 | 1.0 | 1 | 0 | 1 | 0 | 0 | 0 | w/d in unit | detached garage |
| 10266 | 875 | apartment | 555 | 1 | 1.0 | 1 | 1 | 1 | 0 | 0 | 0 | laundry in bldg | carport |
To check the information about our finaldf we are using info() function.
Finaldf.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 3065 entries, 231254 to 10266 Data columns (total 13 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 rent price 3065 non-null int64 1 type 3065 non-null object 2 sqfeet 3065 non-null int64 3 beds 3065 non-null int64 4 baths 3065 non-null float64 5 cats allowed 3065 non-null int64 6 dogs allowed 3065 non-null int64 7 smoking allowed 3065 non-null int64 8 wheelchair access 3065 non-null int64 9 electric vehicle charge 3065 non-null int64 10 furnished 3065 non-null int64 11 Laundry options 3065 non-null object 12 parking options 3065 non-null object dtypes: float64(1), int64(9), object(3) memory usage: 335.2+ KB
Above we can see that we have 1 float variables, 9 integers and 3 object in our dataset.
Below we are displaying the summary of our Continous features.
from IPython.display import display, HTML
display(HTML('<b>Table 1: Summary of continuous features</b>'))
Finaldf.describe(include='int64')
| rent price | sqfeet | beds | cats allowed | dogs allowed | smoking allowed | wheelchair access | electric vehicle charge | furnished | |
|---|---|---|---|---|---|---|---|---|---|
| count | 3065.000000 | 3065.000000 | 3065.000000 | 3065.000000 | 3065.000000 | 3065.000000 | 3065.000000 | 3065.000000 | 3065.000000 |
| mean | 1291.457096 | 1022.322349 | 1.916476 | 0.758891 | 0.732790 | 0.659706 | 0.093964 | 0.017618 | 0.058401 |
| std | 773.531127 | 496.764118 | 0.897963 | 0.427826 | 0.442575 | 0.473885 | 0.291826 | 0.131581 | 0.234539 |
| min | 0.000000 | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| 25% | 839.000000 | 750.000000 | 1.000000 | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| 50% | 1102.000000 | 959.000000 | 2.000000 | 1.000000 | 1.000000 | 1.000000 | 0.000000 | 0.000000 | 0.000000 |
| 75% | 1550.000000 | 1165.000000 | 2.000000 | 1.000000 | 1.000000 | 1.000000 | 0.000000 | 0.000000 | 0.000000 |
| max | 15750.000000 | 13060.000000 | 7.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 |
Here we are displaying the summary of our categorical features.
display(HTML('<b>Table 2: Summary of categorical features</b>'))
Finaldf.describe(include='object')
| type | Laundry options | parking options | |
|---|---|---|---|
| count | 3065 | 3065 | 3065 |
| unique | 10 | 5 | 7 |
| top | apartment | w/d in unit | off-street parking |
| freq | 2401 | 1388 | 1579 |
Here we are can the summary of our Float Features.
display(HTML('<b>Table 3: Summary of Float features</b>'))
Finaldf.describe(include='float64')
| baths | |
|---|---|
| count | 3065.000000 |
| mean | 1.489886 |
| std | 0.589338 |
| min | 0.000000 |
| 25% | 1.000000 |
| 50% | 1.000000 |
| 75% | 2.000000 |
| max | 5.500000 |
Below we are highlighting the unique values that are present in our categorical variables.
categoricalColumns = Finaldf.columns[Finaldf.dtypes==object].tolist()
for col in categoricalColumns:
print('Unique values for ' + col)
print(Finaldf[col].unique())
print('')
Unique values for type ['apartment' 'condo' 'house' 'manufactured' 'duplex' 'townhouse' 'loft' 'in-law' 'cottage/cabin' 'flat'] Unique values for Laundry options ['w/d in unit' 'laundry on site' 'laundry in bldg' 'w/d hookups' 'no laundry on site'] Unique values for parking options ['off-street parking' 'carport' 'detached garage' 'street parking' 'attached garage' 'no parking' 'valet parking']
Here, we are using groupby function to see the individual kinds of parking options available in as per our type of houses.
Such as For Apartments, 1317 off-street parking are available, 470 carport and so on.
Finaldf.groupby('type')['parking options'].value_counts()
type parking options
apartment off-street parking 1317
carport 470
attached garage 254
detached garage 187
street parking 152
no parking 19
valet parking 2
condo off-street parking 42
attached garage 25
carport 12
detached garage 6
street parking 2
cottage/cabin off-street parking 10
duplex off-street parking 23
attached garage 17
street parking 10
carport 6
detached garage 5
flat off-street parking 4
attached garage 1
no parking 1
house attached garage 151
off-street parking 88
carport 22
detached garage 20
street parking 9
no parking 1
in-law off-street parking 1
loft off-street parking 7
attached garage 4
street parking 1
manufactured off-street parking 37
carport 7
street parking 3
townhouse attached garage 63
off-street parking 50
carport 20
detached garage 9
street parking 7
Name: parking options, dtype: int64
Here, we are using groupby function to see the individual kinds of Laundry options available in as per our type of houses.
Such as For Apartments, 1068 w/d in unit laundry options are available, 512 laundry on site and so on.
Finaldf.groupby('type')['Laundry options'].value_counts()
type Laundry options
apartment w/d in unit 1068
laundry on site 512
w/d hookups 401
laundry in bldg 381
no laundry on site 39
condo w/d in unit 66
laundry in bldg 8
w/d hookups 8
laundry on site 5
cottage/cabin laundry on site 5
w/d in unit 3
no laundry on site 1
w/d hookups 1
duplex w/d in unit 25
w/d hookups 21
laundry on site 7
no laundry on site 5
laundry in bldg 3
flat w/d in unit 3
laundry on site 2
no laundry on site 1
house w/d in unit 127
w/d hookups 124
laundry in bldg 17
laundry on site 17
no laundry on site 6
in-law w/d in unit 1
loft w/d in unit 9
laundry in bldg 2
w/d hookups 1
manufactured w/d hookups 32
w/d in unit 14
laundry in bldg 1
townhouse w/d in unit 72
w/d hookups 58
laundry on site 12
laundry in bldg 6
no laundry on site 1
Name: Laundry options, dtype: int64
Below we call all the boolean variables available in our dataset. Boolean variables are those variables which values are "yes" or "no" but in dataset they are recorded as "0" and "1".
Here The boolean variables are cats allowed,dogs allowed, smoking allowed,wheelchair access,electric vehicle charge and furnished.
bool_vars = [var for var in Finaldf if Finaldf[var].nunique() == 2]
Finaldf[bool_vars].head()
| cats allowed | dogs allowed | smoking allowed | wheelchair access | electric vehicle charge | furnished | |
|---|---|---|---|---|---|---|
| 231254 | 1 | 1 | 1 | 0 | 0 | 0 |
| 39715 | 1 | 1 | 0 | 1 | 0 | 0 |
| 61857 | 1 | 1 | 1 | 0 | 0 | 0 |
| 1984 | 1 | 1 | 1 | 0 | 0 | 0 |
| 145290 | 0 | 0 | 1 | 0 | 0 | 0 |
Now, we will show the data visualization by using one variable, two variable and three variable by plotting two graphs of each type.
Below is Boxplot(Fig 1) which is showing the Average Rent price. We have use plotly function to create the Boxplot.
fig1 = px.box(Finaldf, y="rent price",
title='Boxplot showing Average rent price')
fig1
The average rent as per the box plot is $1102, it is not the exact average rent as it clearly visible in the boxplot that there are various outliers(high rent). We can remove the outliers present in the dataset, but we will not as the outliers are the value of rent prices which can be high and by removing them, it can alter our dataset and our further prediction.
Fig 2 is showing the different number of places available for rent as the parking options.
fig2 = px.bar(Finaldf,x='parking options',hover_data=['parking options'],
title='Barchart showing Different places available for rent available as per parking options')
fig2.show()
Fig 2 displays various number of places to live as per different parking options.
Fig 3 is a scatter plot which highlights the relationship between rent prices and the size of place.
fig3 = px.scatter(Finaldf, x='sqfeet', y='rent price',
title='Scatter plot showing relationship between rent prices and size of place')
fig3.show()
As it is not clearly visible, but with we can say that there is positive linear relationship between rent price and the sqfeet of the place. A place which of 1189 has rent of USD 489, while 3220 sqfeet has rent of USD 5000. In another word, if a place is has larger sqfeet(area), then it has high price as compared to place which has less sqfeet.
In Fig 4, a boxplot is displaying a relationship between Number of bedrooms in a accomodation and its price by showing the average prices.
fig4 = px.box(Finaldf, x='beds',y='rent price',
title='Boxplot showing Average price as per Number of beds available')
fig4.show()
In fig4, we can see that if a place has more bed room then the price is higher. In another word, as the bedroom in a place increases, it's price also increases. A place with 0 Bedroom has a Average rent of 4831, while a place with 1 bedroom has a rent of $970 and so on.
Fig 5 contains a scatter plot which shows the price of different houses as per their parking options.
fig5 = px.scatter(Finaldf,x='type', y='rent price', color='parking options',
title='Scatter plot showing price of different types of houses as per their parking options')
fig5.show()
In fig 6, we can a Bar chart which is exhibiting information regarding the different prices of houses as per their laundry options and number of bedrooms.
fig6 = px.bar(Finaldf, y='rent price',x='Laundry options',color='beds',
title='Barchart showing price of different apartments as per their Laundry options and bedrooms')
fig6.show()
We can see that if number of bedrooms increases, then the prices of the places also increases and every present laundry options.
The main aim of this report is to import the dataset, describe the variables available in the dataset, make the dataset neat and clean for future use and showing the descriptive visualizations of the variables.
First we provide a descriptive summary of the features available in dataset, and then perform data pre-processing by importing the dataset.
After importing we saw the shape of dataset which shows that dataset has 265190 rows and 22 columns. After that, i took a subset of dataset containing 5000 rows only and then we see the datatypes of the variables.
After that i remove the missing values present in the dataset as they can cause some issues in future.once the missing values are removed we can see that there 3065 rows and 22 columns present in the dataset. After that i renamed the columns and named the dataset as Finaldf.
Then we see the summary of all variables of dataset and the unique values present in the categorical columns.
Then we begin the visualizations. First we show a boxplot of average rent price which says that the average rent is $1102, it is not the exact average rent as it clearly visible in the boxplot that there are various outliers(high rent)
Fig 2 is showing the different number of places available for rent as per different parking options.
Then visualization using 2 variables, initial a scatter plot showing a positive linear relationship between rent prices and size of place, then a boxplot which shows that as the number of beds increases, price of accomodation also increases.
Then we proceed to visualization by using three variables. Fig 5 is a scatter plot which shows the price of different houses as per their parking options and fig 6 is Bar chart which shows the prices of each unit as the number of bedrooms increases different laundry options.
That's all we have done in this report/Phase 1
After that we will proceed to Phase , which is prediction of rent.